Skip to content

Conversation

@sychen52
Copy link
Contributor

@sychen52 sychen52 commented Jan 9, 2026

What does this PR do?

Type of change: new feature

Overview:

Usage

cd ./examples/llm_ptq/
python hf_ptq.py \
    --pyt_ckpt_path Qwen/Qwen3-4B \
    --export_path /home/scratch.shiychen_coreai/quantized_models/Qwen3-4B-svdq \
    --qformat nvfp4_awq_svdquant --kv_cache_qformat none --sparsity_fmt dense --calib_size 8

Testing

exported checkpoint and loaded.

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Added nvfp4_svdquant as a new quantization format option for LLM model quantization workflows.
  • Limitations

    • Multi-GPU export configurations using tensor or pipeline parallelism are not supported with nvfp4_svdquant quantization.

✏️ Tip: You can customize this high-level summary in your review settings.

@sychen52 sychen52 requested review from a team as code owners January 9, 2026 05:16
@sychen52 sychen52 force-pushed the svdquant branch 2 times, most recently from a788b53 to 34e75e5 Compare January 9, 2026 05:22
@codecov
Copy link

codecov bot commented Jan 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.23%. Comparing base (c1956b8) to head (dc02325).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #754   +/-   ##
=======================================
  Coverage   74.22%   74.23%           
=======================================
  Files         192      192           
  Lines       19035    19038    +3     
=======================================
+ Hits        14129    14132    +3     
  Misses       4906     4906           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@jingyu-ml jingyu-ml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, including the approach for fusing the QKV and FFN layers. The current resmooth + refusion process means the resulting model is not exactly identical to the original, but this appears to be the only viable option at the moment unless we can fuse these layers during calibration...
Thank you for your work!

def svd(weight, rank):
original_device = weight.device
original_dtype = weight.dtype
weight_f64 = weight.to(dtype=torch.float64, device=original_device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need f64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. I kept what @jingyu-ml has originally. This part is just a refactoring so that I can reuse this code during qkv fusion.

Copy link
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

for module in modules:
if not torch.equal(module.input_quantizer.pre_quant_scale, avg_prequant_scale):
_update_pre_quant_scale(module, avg_prequant_scale)
if hasattr(modules[0].weight_quantizer, "svdquant_lora_a"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can svdquant_lora_a be None in any case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will skip it when It is None.

@meenchen
Copy link
Contributor

Do we have unit tests for svd quant?

@sychen52
Copy link
Contributor Author

Do we have unit tests for svd quant?

I think we have unittest for svdquant, but not this export part.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

This change introduces support for NVFP4 SVDQUANT (SVD-based quantization) throughout the modelopt export pipeline. It adds configuration options, defines a new quantization constant, extends quantization utilities to recognize and process SVD quantization, adds an SVD computation helper, and updates export logic to handle this quantization type consistently with existing variants.

Changes

Cohort / File(s) Summary
Configuration & Example Scripts
examples/llm_ptq/hf_ptq.py, examples/llm_ptq/scripts/huggingface_example.sh
Added nvfp4_svdquant to quantization configuration choices; maps to mtq.NVFP4_SVDQUANT_DEFAULT_CFG. Includes runtime guard raising NotImplementedError for multi-GPU export. Updated shell script to accept nvfp4_svdquant in qformat validation.
Export Model Configuration
modelopt/torch/export/model_config.py
Introduced QUANTIZATION_NVFP4_SVDQUANT constant and updated hidden_size calculations to include this quantization type alongside QUANTIZATION_NVFP4 and QUANTIZATION_NVFP4_AWQ for both MOE and non-MOE branches.
Quantization Utilities & Post-processing
modelopt/torch/export/quant_utils.py, modelopt/torch/export/postprocess.py
Extended quantization format detection and scaling factor retrieval to recognize SVDQUANT. Added _update_svdquant helper to recompute pre_quant_scale, lora weights, and rebuild quantizer statistics. Updated TP-merge logic to treat SVDQUANT equivalently to NVFP4_AWQ for weight scaling factor updates.
Unified Export
modelopt/torch/export/unified_export_hf.py
Expanded conditional logic to include QUANTIZATION_NVFP4_SVDQUANT in pre-quant fusion, MoE expert processing, and weight export paths (transposition and quantization steps).
Core Quantization Implementation
modelopt/torch/quantization/model_calib.py
Added svd(weight, rank) helper function that computes truncated SVD in double-precision and returns singular vectors. Refactored svdquant postprocess to use this helper and directly assign vt and us to svdquant_lora_a and svdquant_lora_b.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding support for exporting svdquant-quantized checkpoints to Hugging Face format. It is specific, clear, and directly related to the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 510-513: Update the NotImplementedError message string in the
conditional that checks args.inference_tensor_parallel,
args.inference_pipeline_parallel and args.qformat ("nvfp4_svdquant") to fix the
typo: change "mulitple" to "multiple" so the raised message reads "Svdquant does
not support multiple GPUs yet." Refer to the conditional using
args.inference_tensor_parallel, args.inference_pipeline_parallel, args.qformat
and the raised NotImplementedError to locate the change.
♻️ Duplicate comments (1)
modelopt/torch/export/quant_utils.py (1)

498-500: Fix SVDQuant detection to check actual LoRA buffers (not just attribute presence).

hasattr(weight_quantizer, "svdquant_lora_a") is true even when the buffer is unset, so non‑SVD NVFP4 quantizers could be misclassified as SVDQUANT. Use a value check (and include svdquant_lora_b) to avoid false positives.

🛠️ Proposed fix
-            if input_quantizer is not None and hasattr(weight_quantizer, "svdquant_lora_a"):
-                return QUANTIZATION_NVFP4_SVDQUANT
+            if (
+                input_quantizer is not None
+                and getattr(weight_quantizer, "svdquant_lora_a", None) is not None
+                and getattr(weight_quantizer, "svdquant_lora_b", None) is not None
+            ):
+                return QUANTIZATION_NVFP4_SVDQUANT

Comment on lines +510 to +513
if (
args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1
) and args.qformat == "nvfp4_svdquant":
raise NotImplementedError("Svdquant does not support mulitple GPUs yet.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in error message.

"mulitple" should be "multiple".

✏️ Proposed fix
             if (
                 args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1
             ) and args.qformat == "nvfp4_svdquant":
-                raise NotImplementedError("Svdquant does not support mulitple GPUs yet.")
+                raise NotImplementedError("Svdquant does not support multiple GPUs yet.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (
args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1
) and args.qformat == "nvfp4_svdquant":
raise NotImplementedError("Svdquant does not support mulitple GPUs yet.")
if (
args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1
) and args.qformat == "nvfp4_svdquant":
raise NotImplementedError("Svdquant does not support multiple GPUs yet.")
🤖 Prompt for AI Agents
In `@examples/llm_ptq/hf_ptq.py` around lines 510 - 513, Update the
NotImplementedError message string in the conditional that checks
args.inference_tensor_parallel, args.inference_pipeline_parallel and
args.qformat ("nvfp4_svdquant") to fix the typo: change "mulitple" to "multiple"
so the raised message reads "Svdquant does not support multiple GPUs yet." Refer
to the conditional using args.inference_tensor_parallel,
args.inference_pipeline_parallel, args.qformat and the raised
NotImplementedError to locate the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants